“Advanced Graphics and Data Visualization in R” is brought to you by the Centre for the Analysis of Genome Evolution & Function’s (CAGEF) bioinformatics training initiative. CSB1021 was developed to enhance the skills of students with basic backgrounds in R by focusing on available philosophies, methods, and packages for plotting scientific data. Many of the datasets and examples used in this course will be drawn from real-world datasets and the techniques learned herein aim to be broadly applicable to multiple fields.
This lesson is the fifth in a 6-part series. The aim for the end of this series is for students to recognize how to import, format, and display data based on their intended message and audience. The format and style of these visualizations will help to identify and convey the key message(s) from their experimental data.
The structure of the class is a code-along style in R markdown notebooks. At the start of each lecture, skeleton versions of the lecture will be provided for use on the University of Toronto datatools Hub so students can program along with the instructor.
This week will focus on exploring the relationships within your data observations comparing between experimental samples and applying a suite of cluster, dimension reduction and projection algorithms.
At the end of this lecture you will have covered visualizations related to:
grey background - a package, function, code, command or
directory. Backticks are also use for in-line code.
italics - an important term or concept or an individual file or
folder
bold - heading or a term that is being defined
blue text - named or unnamed
hyperlink
... - Within each coding cell this will indicate an area
of code that students will need to complete for the code cell to run
correctly.
Blue box: A key concept that is being introduced
Yellow box: Risk or caution
Green boxes: Recommended reads and resources to learn R
Red boxes: A comprehension question which may or may not involve a coding cell. You usually find these at the end of a section.
Each week, new lesson files will appear within your RStudio folders.
We are pulling from a GitHub repository using this Repository
git-pull link. Simply click on the link and it will take you to the
University of Toronto datatools
Hub. You will need to use your UTORid credentials to complete the
login process. From there you will find each week’s lecture files in the
directory /2025-03-Adv_Graphics_R/Lecture_XX. You will find
a partially coded skeleton.Rmd file as well as all of the
data files necessary to run the week’s lecture.
Alternatively, you can download the R-Markdown Notebook
(.Rmd) and data files from the RStudio server to your
personal computer if you would like to run independently of the Toronto
tools.
A live lecture version will be available at camok.github.io that will update as the lecture progresses. Be sure to refresh to take a look if you get lost!
At the end of each lecture there will be a completed version of the lecture code released as an HTML file under the Modules section of Quercus.
Today’s datasets will focus on the smaller datasets we worked on in earlier lectures (namely our Ontario public health unit COVID-19 demographics data), and a new set of RNAseq analysis on different tissue samples from COVID-19 patients
This is a combination of datasets from previous lectures. This incorporates PHU demographic data with StatsCan census data from 2017 to produce a normalized estimate of cases, deaths, and hospitalizations across age groups and public health units in Ontario.
This is the same readcount data we looked at in Lecture 04. RNA-Seq read count data generated from SARS-CoV2 infections of AEC cells. Used to compare the timecourse of expression (pro-inflammatory) changes in samples treated with and without HSP90 inhibitors. Published in iScience doi: https://doi.org/10.1016/j.isci.2021.102151
From Desai et al., 2020 on medRxiv doi: https://doi.org/10.1101/2020.07.30.20165241 this dataset has normalized expression counts from RNAseq data. It covers multiple samples and tissues from COVID-positive patients with a focus on lung tissue. The expression data has been unfiltered for SARS-CoV-2 expression data as well.
From Desai et al., 2020 on medRxiv doi: https://doi.org/10.1101/2020.07.30.20165241 this dataset contains patient information like viral load from tissues that were used for RNAseq.
tidyverse which has a number of packages including
dplyr, tidyr, stringr,
forcats and ggplot2
viridis helps to create color-blind palettes for our
data visualizations
RColorBrewer has some hlepful palettes that we’ll need
to colour our data.
gplots will be used to help generate heatmap/dendrogram
visualizations.
FactoMineR and factoextra will be used for
PCA generation and visualization
Rtsne and umap are packages implementing
non-linear projection algorithms
# Some packages can be installed via Bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("ComplexHeatmap")
## 'getOption("repos")' replaces Bioconductor standard repositories, see
## 'help("repositories", package = "BiocManager")' for details.
## Replacement repositories:
## CRAN: https://cran.r-project.org
## Bioconductor version 3.12 (BiocManager 1.30.22), R 4.0.5 (2021-03-31)
## Warning: package(s) not installed when version(s) same as or greater than current; use
## `force = TRUE` to re-install: 'ComplexHeatmap'
## Installation paths not writeable, unable to update packages
## path: C:/Program Files/R/R-4.0.5/library
## packages:
## boot, class, cluster, codetools, crayon, evaluate, foreign, KernSmooth,
## lattice, mgcv, nlme, nnet, pbdZMQ, rlang, rpart, spatial, survival
## Old packages: 'abind', 'ade4', 'ape', 'askpass', 'backports', 'basefun',
## 'bdsmatrix', 'bench', 'BiocManager', 'bit', 'bit64', 'bitops', 'brew',
## 'brio', 'broom', 'bslib', 'cachem', 'Cairo', 'callr', 'car', 'classInt',
## 'cli', 'clue', 'commonmark', 'corrplot', 'covr', 'cowplot', 'coxme',
## 'credentials', 'crosstalk', 'curl', 'data.table', 'DBI', 'dbplyr',
## 'dendextend', 'DEoptimR', 'desc', 'digest', 'downlit', 'dplyr', 'dreamerr',
## 'DT', 'e1071', 'fansi', 'fastmap', 'fixest', 'foghorn', 'fontawesome', 'fs',
## 'future', 'gee', 'geomtextpath', 'gert', 'ggnewscale', 'ggplot2', 'ggsci',
## 'gh', 'git2r', 'glmmTMB', 'globals', 'glue', 'haven', 'highr', 'htmltools',
## 'htmlwidgets', 'httpuv', 'httr2', 'hunspell', 'ISwR', 'jpeg', 'jsonlite',
## 'knitr', 'Lahman', 'later', 'leaps', 'lintr', 'listenv', 'lme4', 'lubridate',
## 'maps', 'markdown', 'MatrixModels', 'matrixStats', 'mclust', 'mice',
## 'microbenchmark', 'mime', 'minqa', 'mlt', 'mockery', 'mratios', 'multcomp',
## 'mvtnorm', 'nloptr', 'odbc', 'openssl', 'ordinal', 'pander', 'parallelly',
## 'parsedate', 'pillar', 'pingr', 'pixmap', 'pkgbuild', 'pkgdown', 'pkgload',
## 'plyr', 'polyclip', 'prettyunits', 'processx', 'profvis', 'progress',
## 'promises', 'ps', 'purrr', 'quantreg', 'R.oo', 'R.utils', 'ragg', 'Rcpp',
## 'RcppArmadillo', 'RcppEigen', 'RcppTOML', 'RCurl', 'readr', 'readxl',
## 'remotes', 'renv', 'repr', 'reprex', 'reticulate', 'rhub', 'rjson',
## 'RMariaDB', 'rmarkdown', 'RMySQL', 'robustbase', 'roxygen2', 'RPostgres',
## 'RPostgreSQL', 'rprojroot', 'rsconnect', 'RSpectra', 'RSQLite', 'rstudioapi',
## 'rvest', 'rzmq', 's2', 'sandwich', 'sass', 'sessioninfo', 'sf', 'shiny',
## 'showtext', 'SimComp', 'sp', 'SparseM', 'spatstat.data', 'spatstat.geom',
## 'spatstat.random', 'spatstat.utils', 'spelling', 'splancs', 'stringi',
## 'stringr', 'survPresmooth', 'sys', 'sysfonts', 'systemfonts', 'testthat',
## 'textshaping', 'TH.data', 'tidyr', 'timechange', 'tinytex', 'TMB', 'tram',
## 'tzdb', 'ucminf', 'units', 'usethis', 'utf8', 'uuid', 'V8', 'vctrs',
## 'vdiffr', 'vipor', 'viridis', 'vroom', 'waldo', 'webutils', 'wk', 'writexl',
## 'xfun', 'xml2', 'xopen', 'xts', 'yaml', 'zip', 'zoo'
install.packages("FactoMineR", dependencies = TRUE)
## Installing package into 'C:/Users/mokca/AppData/Local/R/win-library/4.0'
## (as 'lib' is unspecified)
## Warning: dependency 'emmeans' is not available
## also installing the dependencies 'FactoInvestigate', 'missMDA', 'Factoshiny'
##
## There are binary versions available but the source versions are later:
## binary source needs_compilation
## FactoInvestigate 1.7 1.9 FALSE
## missMDA 1.18 1.19 FALSE
## Factoshiny 2.4 2.7 FALSE
## FactoMineR 2.4 2.11 TRUE
## installing the source packages 'FactoInvestigate', 'missMDA', 'Factoshiny', 'FactoMineR'
## Warning in install.packages("FactoMineR", dependencies = TRUE): installation of
## package 'FactoMineR' had non-zero exit status
## Warning in install.packages("FactoMineR", dependencies = TRUE): installation of
## package 'FactoInvestigate' had non-zero exit status
## Warning in install.packages("FactoMineR", dependencies = TRUE): installation of
## package 'missMDA' had non-zero exit status
## Warning in install.packages("FactoMineR", dependencies = TRUE): installation of
## package 'Factoshiny' had non-zero exit status
install.packages("factoextra", dependencies = TRUE)
## Warning: package 'factoextra' is in use and will not be installed
install.packages("Rtsne")
## Warning: package 'Rtsne' is in use and will not be installed
install.packages("umap")
## Warning: package 'umap' is in use and will not be installed
# ------------ Legacy installation code which we hopefully never need again ----------#
# We need to specifically install a package called pbkrtest for the facto packages
# This is due to using an older verion of R
# library(remotes)
# install_version("pbkrtest", "0.5.1")
# Packages to help tidy our data
library(tidyverse)
library(readxl)
library(magrittr)
# Packages for the graphical analysis section
library(viridis)
# library(gplots) # heatmap2()
library(ComplexHeatmap) # Heatmap()
library(RColorBrewer)
# Useful for PCA and PCA visualization
library(FactoMineR)
## Error in library(FactoMineR): there is no package called 'FactoMineR'
library(factoextra)
# Data projection packages
library(Rtsne)
library(umap)
Last week we looked at an analysis of RNAseq data through a number of methods starting with broad-level volcano plots and moving towards gene-level expression visualizations with dotplots. In between we stopped to take a look at heatmaps. In this instance we simply used heatmaps to convey expression levels of multiple genes across one or more samples.
Looking more broadly, we now wish to ask questions such as “how similar are our samples?”, “can our samples be grouped or categorized in some way?” and “is there an underlying structure or architecture to our data?” We briefly discussed scatterplot matrices to help analyse our sample quality among replicate experiments.
We can study these questions using a number of techniques that range from clustering/categorization to projection of high-dimensional data onto a lower set of dimensions (usually 2 or 3).
We cannot begin our journey until we talk about the nature of the data we are interested in examining. Usually, our data will consist of many samples (observations) with some number of features (variables) captured about each observation. For example, with RNAseq data we could consider the measurements we capture for each gene as a separate feature/variable/column.
Conversely you may have hundreds or thousands of samples you’d like to categorize in some way (or show that you can categorize) with just a smaller set of features. For every nut, there is a tool of sorts for cracking it!
We’ll start with a COVID dataset collected by Ontario. It consists of a collection of demographic case data from across 34 public health units (PHUs) from January 2020 to March 2022. Observations are broken down by PHU and age group. The demographics data has been modified to include a set of normalized values (individuals per 100k) that was calculated based on StatsCan 2017 census data. Modeling off of these data, the 2022 sizes for each age group were estimated. Case, death, and hospitalization counts were normalized based on these subgroup sizes to generate values per 100K individuals. We’ll be working with this 2022 dataset mainly because it utilizes some fine-grained binning of age-groups and is an appropriate size for first investigating the idea of clustering data.
The dataset will be found in
phu_demographics_census_norm.csv and we’ll use it to guide
us through two sections of today’s lecture.
Let’s begin by loading the data and taking a look at its structure.
# Import the normalized demographics data
covid_demographics_norm.df = read_csv("data/phu_demographics_census_norm.csv")
## [1mRows: [22m[34m238[39m [1mColumns: [22m[34m15[39m
## [36m--[39m [1mColumn specification[22m [36m-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------[39m
## [1mDelimiter:[22m ","
## [31mchr[39m (4): from_date, to_date, public_health_unit, age_group
## [32mdbl[39m (11): total_cases, total_deaths, total_hospitalizations_count, percent_c...
##
## [36mi[39m Use `spec()` to retrieve the full column specification for this data.
## [36mi[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Look at it's structure
str(covid_demographics_norm.df, give.attr = FALSE)
## spc_tbl_ [238 x 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ from_date : chr [1:238] "January 15, 2020" "January 15, 2020" "January 15, 2020" "January 15, 2020" ...
## $ to_date : chr [1:238] "March 2, 2022" "March 2, 2022" "March 2, 2022" "March 2, 2022" ...
## $ public_health_unit : chr [1:238] "Algoma" "Algoma" "Algoma" "Algoma" ...
## $ age_group : chr [1:238] "0 to 4" "12 to 19" "20 to 39" "40 to 59" ...
## $ total_cases : num [1:238] 181 412 1868 1424 379 ...
## $ total_deaths : num [1:238] 0 0 1 5 0 14 13 0 0 2 ...
## $ total_hospitalizations_count: num [1:238] 9 5 31 48 0 69 44 10 2 35 ...
## $ percent_cases : num [1:238] 0.0343 0.0781 0.3543 0.2701 0.0719 ...
## $ percent_deaths : num [1:238] 0 0 0.0303 0.1515 0 ...
## $ percent_hospitalizations : num [1:238] 0.0437 0.0243 0.1505 0.233 0 ...
## $ population_2022 : num [1:238] 117840 117840 117840 117840 117840 ...
## $ individuals_2022 : num [1:238] 5481 9230 24676 33111 7751 ...
## $ cases_per_100k : num [1:238] 3302 4464 7570 4301 4890 ...
## $ deaths_per_100k : num [1:238] 0 0 4 15 0 46 177 0 0 5 ...
## $ hospitalizations_per_100k : num [1:238] 164 54 126 145 0 228 598 116 13 95 ...
Let’s begin looking at our covid_demographics_norm.df.
Using a series of data sets, we’ve created a consolidated dataset
with:
Cumulative cases, deaths, and hospitalizations due to COVID-19 within each age group per PHU.
Representation of each age group within each data category as a percent of total incidents in each PHU.
Using 2017 census data, the number of cases per 100,000 individuals normalized by estimated population size for each age group within each PHU.
The question we want to answer is: of the 34 public health units, which look most similar based on the normalized case data for each age group? In order to visualize this data, we’ll want to convert our current long-form data that looks something like this:
| public_health_unit | age_group | total_cases | … | cases_per_100k | deaths_per_100k | hospitalizations_per_100k |
|---|---|---|---|---|---|---|
| Algoma | 0 to 4 | 181 | … | 3302 | 0 | 164 |
| Algoma | 12 to 19 | 412 | … | 4464 | 0 | 54 |
| … | … | … | … | … | … | … |
Into something like this:
| public_health_unit | population_2022 | category | 0 to 4 | 5 to 11 | 12 to 19 | 20 to 39 | 40 to 59 | 60 to 79 | 80+ |
|---|---|---|---|---|---|---|---|---|---|
| Algoma | 117840 | cases | 3302 | 4464 | 7570 | 4301 | 4890 | 2331 | 4116 |
| … | … | … | … | … | … | … | … | … | … |
We need to do the following to the dataset
# Create a wide-format version of our normalized data
covid_demographics_norm_wide.df <-
# Start by passing along the long-form normalized data
covid_demographics_norm.df %>%
# Select for just the PHU, age_group, population size and cases/hosp/deaths_per_100k
dplyr::select(c(3,4,11,13:15)) %>%
# Pivot the data to a longer format
pivot_longer(cols = -c(1:3), names_to = "category", values_to = "per_100k") %>%
# Get rid of the suffix of "_per_100k"
mutate(category = str_remove_all(string = category, pattern = "_per_100k")) %>%
# Pivot the age_group/per_100k data out wider
pivot_wider(names_from = age_group,
values_from = per_100k
) %>%
# Move the "5 to 11" category to after "0 to 4". You could use a longer "select" call to do this too!
relocate(., `5 to 11`, .after=`0 to 4`)
# Take a look at our resulting dataframe
head(covid_demographics_norm_wide.df)
## [90m# A tibble: 6 x 10[39m
## public_health_unit population_2022 category `0 to 4` `5 to 11` `12 to 19`
## [3m[90m<chr>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m
## [90m1[39m Algoma [4m1[24m[4m1[24m[4m7[24m840 cases [4m3[24m302 [4m4[24m890 [4m4[24m464
## [90m2[39m Algoma [4m1[24m[4m1[24m[4m7[24m840 deaths 0 0 0
## [90m3[39m Algoma [4m1[24m[4m1[24m[4m7[24m840 hospitalizat~ 164 0 54
## [90m4[39m Brant County [4m1[24m[4m5[24m[4m3[24m558 cases [4m4[24m481 [4m5[24m376 [4m6[24m499
## [90m5[39m Brant County [4m1[24m[4m5[24m[4m3[24m558 deaths 0 0 0
## [90m6[39m Brant County [4m1[24m[4m5[24m[4m3[24m558 hospitalizat~ 116 15 13
## [90m# i 4 more variables: `20 to 39` <dbl>, `40 to 59` <dbl>, `60 to 79` <dbl>,[39m
## [90m# `80+` <dbl>[39m
At this point, we would normally be prepared to visualize our data
with ggplot but our data is in wide-format
and we’ll be using a package that prefers to work with
matrix data. In that case, we need to strip the data
down further because matrix data must be all of the same type and we
want to work with numeric data! We’ll use as.matrix() for
our conversion.
category variable from it.public_health_unit information over to
the row names of our matrix so we can still track our data
properly.population_2022 column of data but
we’ll have to remember that it’s there.# Now we need to make our matrix and assign row names.
# It's kind of awkward to need this requirement.
# Cast our matrix and drop the first column
covid_demographics_norm.mx <- covid_demographics_norm_wide.df %>%
# Filter for just case data
dplyr::filter(category == "cases") %>%
# Drop the PHU and category data
dplyr::select(-c(public_health_unit, category)) %>%
# Convert to a matrix
as.matrix()
# Set the row names using the information from the data frame
rownames(covid_demographics_norm.mx) <-
covid_demographics_norm_wide.df %>%
# Pull the PHU names
pull(public_health_unit) %>%
# Create a unique vector of them
unique()
# Take a peek at our data
head(covid_demographics_norm.mx)
## population_2022 0 to 4 5 to 11
## Algoma 117840 3302 4890
## Brant County 153558 4481 5376
## Durham Region 711426 4669 5697
## Grey Bruce 176150 2324 3040
## Haldimand-Norfolk 120008 3002 4956
## Haliburton, Kawartha, Pine Ridge District 190728 2263 2865
## 12 to 19 20 to 39 40 to 59 60 to 79
## Algoma 4464 7570 4301 2331
## Brant County 6499 9448 6429 3995
## Durham Region 6407 11018 7333 4985
## Grey Bruce 4643 5995 3489 2022
## Haldimand-Norfolk 5388 9803 5799 3568
## Haliburton, Kawartha, Pine Ridge District 3673 7368 3681 1923
## 80+
## Algoma 4116
## Brant County 5684
## Durham Region 7562
## Grey Bruce 3056
## Haldimand-Norfolk 6261
## Haliburton, Kawartha, Pine Ridge District 3612
Heatmap()From the ComplexHeatmap package we can use the function
Heatmap() to generate a heatmap that can do a little more
than what we were doing with our ggplot2 version. A nice
aspect of using Heatmap() is that we can ask it to reorder
our samples along both the rows and columns. At the same time we can
present the relationship between columns elements or row elements in the
form of dendrograms.
In its current form, we have our ordered our data along the columns in an ascending ordinal arrangement. While this makes sense to us now, it may not help to visually identify the strongest trends or groupings in our data. Clustering attempts to bring order to our data by grouping data according to specific algorithms.
Overall single points are usually grouped together as neighbours with the “nearest” neighbours determined by a metric of some kind. These clusters are then further grouped using the same metrics until the entire set is presented in these ordered groups. These groupings/relationships can also be presented as dendrograms. In our current case, we aren’t concerned with determining groups per se but rather with just connecting them by their similarity and then creating a hierarchical order.
Our data is usually coded in n-dimensional space, depending on the
nature of our dataset. For covid_demographics_norm.mx our 7
columns are coded by data from 34 PHUs meaning each column is coded in
34 dimensions or 34 features. Conversely, our 34 rows represent PHUs
each coded by data across 7 age groups and therefore 7 dimensions.
Important or helpful parameters to run Heatmap():
matrix: our matrix object. It must be a
matrix, and not a data frame. It could also be a vector
but that becomes a single column.col: determines the colours used for the image -
nothing to do with column names.name: the name/title of the heatmap (also use as the
legend title by default)row_* : set our row text properties including titles
and labels:
row_title, row_title_side,
row_title_gp (graphic properties)row_names_side row_names_gpcolumn_*cluster_rows and cluster_columns: logical
parameters to determine if clustering should occur (default is
TRUE)show_heatmap_legend: whether or not to show the heatmap
legendheatmap_legend_param: Set the title of the legend
specifically is list(title = "x")show_[row/col]_dend: Whether or not to show dendograms
for the row/columnThere are actually a lot of options but these should help us make a basic one. Let’s plot our data with and without a dendrogram to compare.
?Heatmap
# Create a Heatmap object
cases_hmap <-
# Supply our matrix minus the populations size
Heatmap(covid_demographics_norm.mx[,-1],
cluster_rows = FALSE, cluster_columns = FALSE, # Don't cluster on either rows or columns
# Use column_title as the title of our heatmap
column_title = "Heatmap of COVID-19 cases in PHUs by age group: unclustered",
# Rotate the legend horizontally and give it a title
heatmap_legend_param = list(title = "cases per 100K individuals",
legend_direction = "horizontal"),
# Rotate column names to horizontal
column_names_rot = 0,
column_names_center = TRUE
)
# Plot the heatmap
ComplexHeatmap::draw(cases_hmap,
# Plot the legend on the bottom
heatmap_legend_side = "bottom"
)
# Create a Heatmap object
cases_hmap <-
# Supply our matrix minus the populations size
Heatmap(covid_demographics_norm.mx[,-1],
cluster_rows = TRUE, cluster_columns = TRUE, # Cluster on both rows and columns
# Use column_title as the title of our heatmap
column_title = "Heatmap of COVID-19 cases in PHUs by age group: clustered",
# Rotate the legend horizontally and give it a title
heatmap_legend_param = list(title = "cases per 100K individuals",
legend_direction = "horizontal"),
# Rotate column names to horizontal
column_names_rot = 0,
column_names_center = TRUE
)
# Plot the heatmap
ComplexHeatmap::draw(cases_hmap,
# Plot the legend on the bottom
heatmap_legend_side = "bottom"
)
HeatmapList objectOur heatmap above is drawn from a single dataset category - cases - but it helps us to see that there are some strong signals that differentiate between our different PHUs. Now this is a 34x7 grid where we can investigate all of the data in our dataset. What happens when we want to produce this kind of data from our other two metrics of hospitalizations and deaths?
To accomplish this, we can repeat our steps and create 3 separate
Heatmap objects but we can plot
them together as a single complex heatmap. In fact, rather than
store multiple heatmap objects, we’ll create a HeatmapList
object by concatenating together multiple Heatmap objects
using a for loop.
We’ll recreate our code from above but simplify the heatmap code a little. While we’re at it, we’ll also convert our colourscheme to viridis.
# Create a list of heatmap objects from our demographic data
hm_list <- NULL
# Create some quick vectors to help ourselves out
categories <- c("cases", "hospitalizations", "deaths")
# Store the rownames we want to use on our matrices
mx_rownames <- covid_demographics_norm_wide.df %>% pull(public_health_unit) %>% unique()
# Use a loop to separate the data by category
for (i in categories) {
# Create a temporary matrix of the specific category data
temp_mx <- covid_demographics_norm_wide.df %>%
dplyr::filter(category == i) %>%
dplyr::select(-c(public_health_unit, category)) %>%
as.matrix()
# Set the rownames
rownames(temp_mx) <- mx_rownames
# Create a Heatmap object
hmap <- Heatmap(temp_mx[,-1], # Supply our matrix minus the populations size
# Use a viridis colourscale broken into 100 segments
col = viridis(100),
# Use column_title as the title of our heatmap
column_title = i,
# Rotate the legend horizontally and give it a title based on category
heatmap_legend_param = list(title = paste0(i, " per 100K individuals"),
legend_direction = "horizontal")
)
# Add this to our heatmap list
hm_list <- hm_list + hmap
}
# What kind of object is hm_list?
class(hm_list)
## [1] "HeatmapList"
## attr(,"package")
## [1] "ComplexHeatmap"
draw() your HeatmapListNow that we have our HeatmapList we can simply draw the
whole thing and it will automatically concatenate for us. Of course
there are many options we can use to display the list. One thing to note
is that the clustering of columns in each heatmap is
independent while the clustering of rows (and thus
order) is based on only the first heatmap (by default).
Using the ComplexHeatmap::draw() method to display your
HeatmapList will allow you to further customize details of
the overall figure including setting row_title and
column_title for the entire figure.
ComplexHeatmap::draw(hm_list,
column_title = "COVID-19 metrics breakdown by age group and PHU\n",
column_title_gp = gpar(fontsize = 20) # gpar() sets graphical parameters settings
)
Our heatmap above is drawn from a relatively simple dataset but it helps us to see that there are some strong signals that differentiate between our different PHUs. It become clear that while the bulk of cases in each PHU is dominated by the 20-39 age segment, the bulk of hospitalizations and deaths originate from the 80+ segment.
Of course, this data was cumulative information from January 2020 to March 2022. With the aftermath of vaccinations and different variants, a look at the current demographics may reveals a shift in these trends.
Considering that the heatmap is a 34x7 grid, it is easier to visualize and dissect all of the data in our dataset. What happens, however, when we want to produce this kind of visualization from something much larger or more complex?
Recall from last lecture that we investigated read count data from
Wyler et al., 2020 -
Wyler2020_AEC_SARSCoV2_17AAG_readcounts.tsv. We used this
dataset to produce a scatterplot matrix to compare between some of the
replicate data within the set. Across this set there are 36 sets of
RNA-Seq data spanning 27,011 genes.
Let’s begin by opening up the data file and getting a quick reminder of what it looks like.
# Read in your read_count data
wyler_readcounts.df <- read_tsv("./data/Wyler2020_AEC_SARSCoV2_17AAG_readcounts.tsv")
## [1mRows: [22m[34m27011[39m [1mColumns: [22m[34m38[39m
## [36m--[39m [1mColumn specification[22m [36m-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------[39m
## [1mDelimiter:[22m "\t"
## [31mchr[39m (1): gene
## [32mdbl[39m (37): width, AECII_01_24h_DMSO-1, AECII_02_24h_DMSO-2, AECII_03_24h_DMSO...
##
## [36mi[39m Use `spec()` to retrieve the full column specification for this data.
## [36mi[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(wyler_readcounts.df, give.attr = FALSE)
## spc_tbl_ [27,011 x 38] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ gene : chr [1:27011] "A1BG" "A1BG-AS1" "A1CF" "A2M" ...
## $ width : num [1:27011] 1766 2130 9529 4653 2187 ...
## $ AECII_01_24h_DMSO-1 : num [1:27011] 22 55 0 0 0 ...
## $ AECII_02_24h_DMSO-2 : num [1:27011] 19 20 0 1 0 ...
## $ AECII_03_24h_DMSO-3 : num [1:27011] 15 38 0 0 2 ...
## $ AECII_04_24h_200nM17AAG-1 : num [1:27011] 29 11 0 0 2 4740 0 0 882 0 ...
## $ AECII_05_24h_200nM17AAG-2 : num [1:27011] 27 15 0 0 0 ...
## $ AECII_06_24h_200nM17AAG-3 : num [1:27011] 30 17 0 0 2 4220 0 0 820 0 ...
## $ AECII_07_24h_S2_DMSO-1 : num [1:27011] 24 41 0 0 0 ...
## $ AECII_08_24h_S2_DMSO-2 : num [1:27011] 18 46 0 0 2 ...
## $ AECII_09_24h_S2_DMSO-3 : num [1:27011] 22 42 0 0 4 2750 0 0 739 0 ...
## $ AECII_10_24h_S2_200nM17AAG-1: num [1:27011] 23 13 0 1 3 ...
## $ AECII_11_24h_S2_200nM17AAG-2: num [1:27011] 25 24 0 0 1 ...
## $ AECII_12_24h_S2_200nM17AAG-3: num [1:27011] 18 28 1 0 1 ...
## $ AECII_13_48h_DMSO-1 : num [1:27011] 24 57 0 0 0 ...
## $ AECII_14_48h_DMSO-2 : num [1:27011] 38 56 0 0 0 1930 0 0 892 0 ...
## $ AECII_15_48h_DMSO-3 : num [1:27011] 33 64 0 0 0 ...
## $ AECII_16_48h_200nM17AAG-1 : num [1:27011] 29 35 0 0 1 ...
## $ AECII_17_48h_200nM17AAG-2 : num [1:27011] 20 24 0 0 0 ...
## $ AECII_18_48h_200nM17AAG-3 : num [1:27011] 23 28 0 0 0 ...
## $ AECII_19_48h_S2_DMSO-1 : num [1:27011] 13 45 0 0 0 ...
## $ AECII_20_48h_S2_DMSO-2 : num [1:27011] 22 36 0 0 0 ...
## $ AECII_21_48h_S2_DMSO-3 : num [1:27011] 21 30 0 0 1 ...
## $ AECII_22_48h_S2_200nM17AAG-1: num [1:27011] 31 29 0 0 1 ...
## $ AECII_23_48h_S2_200nM17AAG-2: num [1:27011] 36 23 0 0 1 ...
## $ AECII_24_48h_S2_200nM17AAG-3: num [1:27011] 20 19 1 0 0 ...
## $ AECII_25_72h_DMSO-1 : num [1:27011] 35 43 0 1 2 ...
## $ AECII_26_72h_DMSO-2 : num [1:27011] 26 38 0 0 1 997 0 0 753 0 ...
## $ AECII_27_72h_DMSO-3 : num [1:27011] 29 53 0 0 4 1230 0 0 953 0 ...
## $ AECII_28_72h_200nM17AAG-1 : num [1:27011] 32 57 0 0 2 ...
## $ AECII_29_72h_200nM17AAG-2 : num [1:27011] 41 49 0 0 2 1050 0 0 553 0 ...
## $ AECII_30_72h_200nM17AAG-3 : num [1:27011] 33 37 0 0 3 857 0 0 458 0 ...
## $ AECII_31_72h_S2_DMSO-1 : num [1:27011] 24 58 0 0 1 ...
## $ AECII_32_72h_S2_DMSO-2 : num [1:27011] 33 63 0 0 3 ...
## $ AECII_33_72h_S2_DMSO-3 : num [1:27011] 23 47 1 1 2 ...
## $ AECII_34_72h_S2_200nM17AAG-1: num [1:27011] 22 39 0 1 1 ...
## $ AECII_35_72h_S2_200nM17AAG-2: num [1:27011] 28 26 0 0 2 866 0 0 481 0 ...
## $ AECII_36_72h_S2_200nM17AAG-3: num [1:27011] 43 34 0 0 3 ...
As you might recall from last week, there is quite a bit of data in this set. So that we can more easily visualize our data, we’ll want to wrangle it a bit more by:
Why do we filter our read counts? In this instance we have chosen to filter our read counts. In some cases, you might find wild swings in expression, and it is what we’re looking for most of the time. For in-class analysis, to save a bit on memory and time, we try to limit the read counts to a narrow set to reduce our set size. It will also greatly reduce the size of our heatmap for viewing. By filtering, we’ll go from 27,011 observations to 109.
wyler_readcounts_filtered.df <-
wyler_readcounts.df %>%
# Rename the columns be removing the first portion: AECII_xx
rename_with(., ~ str_replace(string = .x,
pattern = r"(\w*_\d*_)",
replace = "")) %>%
# filter out the low-readcount data
filter(if_all(.cols = -c(1,2), .fns = ~ .x > 1000 & .x < 3000))
# Create our data matrix
wyler_readcounts.mx <- as.matrix(wyler_readcounts_filtered.df[, 3:38])
# Set the row names using the information from the data frame
rownames(wyler_readcounts.mx) <- wyler_readcounts_filtered.df$gene
# Check the characteristics of our matrix
head(wyler_readcounts.mx)
## 24h_DMSO-1 24h_DMSO-2 24h_DMSO-3 24h_200nM17AAG-1 24h_200nM17AAG-2
## AGPAT3 2115 1558 1877 1973 1721
## AHCYL1 1878 1531 1707 2084 1912
## ANKLE2 2099 1401 1787 1296 1156
## ANKRD17 2017 1487 1900 2265 1983
## ANO6 2205 1563 1829 2112 1873
## AP1G1 2204 1649 2002 2175 1996
## 24h_200nM17AAG-3 24h_S2_DMSO-1 24h_S2_DMSO-2 24h_S2_DMSO-3
## AGPAT3 1907 1826 1613 1967
## AHCYL1 2114 1704 1619 1935
## ANKLE2 1202 1778 1761 2083
## ANKRD17 2205 1730 1699 2015
## ANO6 1971 1882 1828 2214
## AP1G1 1998 1780 1769 2120
## 24h_S2_200nM17AAG-1 24h_S2_200nM17AAG-2 24h_S2_200nM17AAG-3 48h_DMSO-1
## AGPAT3 1921 1661 1822 1905
## AHCYL1 2109 1845 2034 2370
## ANKLE2 1377 1336 1257 2056
## ANKRD17 2447 2367 2514 1967
## ANO6 1978 1847 1907 1985
## AP1G1 2094 1901 2069 1939
## 48h_DMSO-2 48h_DMSO-3 48h_200nM17AAG-1 48h_200nM17AAG-2
## AGPAT3 2066 1986 1885 1598
## AHCYL1 2739 2548 2257 1815
## ANKLE2 2320 2163 2013 1661
## ANKRD17 2108 2064 2303 2067
## ANO6 2203 2194 2512 2191
## AP1G1 2166 2019 1589 1349
## 48h_200nM17AAG-3 48h_S2_DMSO-1 48h_S2_DMSO-2 48h_S2_DMSO-3
## AGPAT3 1864 1835 2111 1717
## AHCYL1 2367 1716 1796 1565
## ANKLE2 1838 1690 1888 1656
## ANKRD17 2544 1480 1597 1297
## ANO6 2537 1532 1713 1376
## AP1G1 1781 1506 1681 1454
## 48h_S2_200nM17AAG-1 48h_S2_200nM17AAG-2 48h_S2_200nM17AAG-3 72h_DMSO-1
## AGPAT3 1789 1833 1670 2269
## AHCYL1 1990 2274 1825 2759
## ANKLE2 2108 2256 1744 1865
## ANKRD17 2304 2488 2159 2283
## ANO6 2476 2687 2555 2254
## AP1G1 1488 1751 1672 2233
## 72h_DMSO-2 72h_DMSO-3 72h_200nM17AAG-1 72h_200nM17AAG-2
## AGPAT3 2147 2467 1853 1764
## AHCYL1 2511 2919 2179 2077
## ANKLE
## 7SK 0.00000 0.0000 0.0000 0.000000 0.00000
## A1BG 23.72684 0.0000 0.0000 2.972536 0.00000
## A1BG-AS1 0.00000 0.0000 0.0000 0.000000 0.00000
## A1CF 0.00000 0.0000 0.0000 0.000000 0.00000
## case5.lung1 case5.lung2 case5.lung3 case5.lung4 case5.fat1
## 5S_rRNA 17.8301501 204.14617 5.2150746 6.5310519 0.00000
## 5_8S_rRNA 9.5673976 47.63411 2.8248321 3.4832277 0.00000
## 7SK 0.8697634 0.00000 0.2172948 0.0000000 66.65604
## A1BG 0.0000000 0.00000 0.6518843 0.4354035 0.00000
## A1BG-AS1 3.4790537 0.00000 1.0864739 0.8708069 0.00000
## A1CF 0.0000000 0.00000 0.0000000 0.0000000 0.00000
## case5.lung5 case5.skin1 case5.bowel1 case5.liver1 case5.kidney1
## 5S_rRNA 4.7856618 1.8975379 4.2612716 261.001755 95.246937
## 5_8S_rRNA 1.5952206 0.3795076 0.4734746 66.169459 38.098775
## 7SK 0.0000000 1.8975379 0.4734746 0.000000 5.195287
## A1BG 0.1994026 0.3795076 0.0000000 7.352162 0.000000
## A1BG-AS1 0.1994026 0.0000000 0.2367373 3.676081 3.463525
## A1CF 0.0000000 0.0000000 0.0000000 102.930270 0.000000
## case5.heart1 case5.marrow1 NegControl1 NegControl2 NegControl3
## 5S_rRNA 6.0474605 9.419146 1.0790885 108.314409 4.0667302
## 5_8S_rRNA 3.9308493 7.326002 0.8992404 56.091390 1.3555767
## 7SK 0.3023730 0.000000 0.0000000 0.000000 1.5815062
## A1BG 0.9071191 0.000000 0.0000000 0.000000 0.2259295
## A1BG-AS1 0.6047461 1.046572 1.0790885 2.901279 1.1296473
## A1CF 0.3023730 0.000000 0.0000000 0.000000 0.0000000
## NegControl4 NegControl5 case6.lung1 case6.lung2 case6.lung3
## 5S_rRNA 14.1234049 26.5186080 80.5830895 129.256507 60.0512
## 5_8S_rRNA 4.4600226 4.1435325 24.7947968 33.142694 10.9184
## 7SK 0.9911161 0.8287065 0.0000000 0.000000 21.8368
## A1BG 0.2477790 0.0000000 0.0000000 3.314269 8.1888
## A1BG-AS1 1.4866742 0.0000000 0.0000000 0.000000 5.4592
## A1CF 0.2477790 0.0000000 0.8855285 36.456964 27.2960
## case6.lung4 case6.lung5 case7.lung1 case7.lung2 case7.lung3
## 5S_rRNA 0.00000 87.316422 51.187613 16.8835938 5.459087
## 5_8S_rRNA 30.51539 11.713179 13.650030 4.1068201 2.729543
## 7SK 42.72155 2.129669 5.118761 0.4563133 13.647717
## A1BG 48.82462 0.000000 8.531269 0.4563133 2.729543
## A1BG-AS1 18.30923 0.000000 1.706254 0.4563133 0.000000
## A1CF 109.85540 0.000000 34.125075 2.2815667 9.553402
## case7.lung4 case7.lung5 case8.lung1 case8.lung2 case8.lung3
## 5S_rRNA 96.976830 20.3623448 60.256169 250.998904 23.725512
## 5_8S_rRNA 11.409039 3.0543517 24.102468 59.447109 5.931378
## 7SK 0.000000 0.0000000 0.927018 0.000000 0.659042
## A1BG 5.704519 0.0000000 0.000000 0.000000 0.000000
## A1BG-AS1 0.000000 1.0181172 0.000000 3.302617 0.659042
## A1CF 5.704519 0.5090586 0.000000 0.000000 0.000000
## case8.lung4 case8.lung5 case8.liver1 case8.bowel1 case8.heart1
## 5S_rRNA 16.2110781 11.4957577 318.54089 42.04929 167.454051
## 5_8S_rRNA 0.8532146 2.8739394 75.84307 23.76699 50.236215
## 7SK 0.0000000 0.4105628 0.00000 0.00000 1.860601
## A1BG 0.0000000 0.0000000 34.12938 0.00000 1.860601
## A1BG-AS1 0.4266073 0.4105628 0.00000 0.00000 0.000000
## A1CF 0.0000000 0.0000000 284.41151 0.00000 0.000000
## case9.lung1 case9.lung2 case9.lung3 case9.lung4 case9.lung5
## 5S_rRNA 30.4508211 26.1531478 10.9642819 6.5572268 42.6334973
## 5_8S_rRNA 10.7846658 3.5342092 1.7684326 2.0983126 8.1987495
## 7SK 3.1719605 2.1205255 0.7073730 0.0000000 3.2794998
## A1BG 0.0000000 0.7068418 0.0000000 0.2622891 0.0000000
## A1BG-AS1 0.0000000 0.7068418 0.3536865 0.5245781 0.8198749
## A1CF 0.6343921 0.7068418 0.0000000 0.0000000 2.4596248
## case10.lung1 case10.lung2 case10.lung3 case11.lung1 case11.lung2
## 5S_rRNA 226.248915 69.321897 27.6902247 14.1541194 18.8911136
## 5_8S_rRNA 47.439289 22.218557 15.0318363 6.2907197 13.2937466
## 7SK 0.000000 0.000000 0.0000000 0.5242266 1.3993418
## A1BG 0.000000 0.000000 0.0000000 0.0000000 1.3993418
## A1BG-AS1 0.000000 0.000000 0.7911493 0.5242266 1.3993418
## A1CF 3.649176 1.777485 0.0000000 0.5242266 0.6996709
## case11.lung3 case11.kidney1 case11.bowel1 caseA.lung.NYC
## 5S_rRNA 64.7298859 145.718388 29.0709229 3.216111
## 5_8S_rRNA 13.9757708 35.080353 7.8100987 0.000000
## 7SK 1.4711338 2.698489 0.4338944 0.000000
## A1BG 0.7355669 0.000000 0.0000000 3.216111
## A1BG-AS1 2.2067007 0.000000 0.4338944 0.000000
## A1CF 0.0000000 10.793955 0.0000000 41.809449
## caseB.lung.NYC caseC.lung.NYC caseD.lung.NYC caseP1.placenta1
## 5S_rRNA 6.8112310 16.7622456 64.32178 17.470270
## 5_8S_rRNA 2.7864127 9.3123587 25.72871 4.031601
## 7SK 0.6192028 1.2416478 0.00000 0.000000
## A1BG 0.0000000 0.0000000 0.00000 0.000000
## A1BG-AS1 1.2384056 0.6208239 0.00000 0.000000
## A1CF 0.3096014 0.0000000 0.00000 1.343867
## caseP1.placenta2 caseP2.placenta1 caseP2.placenta2 caseP3.placenta1
## 5S_rRNA 12.483367 8.7605654 9.788657 3.4054256
## 5_8S_rRNA 3.640982 3.9422544 2.719072 1.0478233
## 7SK 8.322244 4.8183110 69.608231 117.8801161
## A1BG 1.040281 0.0000000 0.000000 0.2619558
## A1BG-AS1 0.000000 0.8760565 0.000000 0.2619558
## A1CF 0.000000 0.4380283 0.000000 0.0000000
## caseP3.placenta2 caseP4.placenta1 caseF.lung.NYC caseE.lung.NYC
## 5S_rRNA 16.042997 5.1152960 2.0354397 13.931242
## 5_8S_rRNA 5.347666 0.2692261 0.2907771 10.216244
## 7SK 52.504353 2.9614871 0.0000000 2.786248
## A1BG 0.000000 0.2692261 0.0000000 0.000000
## A1BG-AS1 0.000000 0.2692261 0.2907771 0.000000
## A1CF 0.000000 0.0000000 0.0000000 0.000000
## caseG.lung.NYC caseH.lung.NYC caseI.lung.NYC caseJ.lung.NYC
## 5S_rRNA 7.0430417 3.5559007 222.868159 8.437777
## 5_8S_rRNA 3.3143726 0.0000000 81.613692 7.500246
## 7SK 0.8285931 1.6163185 0.000000 2.343827
## A1BG 0.0000000 0.0000000 0.000000 0.000000
## A1BG-AS1 0.0000000 0.9697911 0.000000 0.000000
## A1CF 0.0000000 0.3232637 9.416964 0.000000
## case3.liver1 case10.liver1 case12.liver1
## 5S_rRNA 117.519303 156.222134 4907.10184
## 5_8S_rRNA 42.476857 32.888870 687.61937
## 7SK 1.415895 8.222218 0.00000
## A1BG 8.495371 0.000000 31.25543
## A1BG-AS1 4.247686 0.000000 0.00000
## A1CF 89.201399 73.999958 0.00000
dim(tissue_data.df)
## [1] 59090 88
# Read in some additional patient data
patient_data.df <- read_excel("./data/2020.07.30.20165241-supp_tables.xlsx", sheet=2)
# Take a quick look at it
head(patient_data.df)
## [90m# A tibble: 6 x 18[39m
## `Case\r\nNo` `Viral\r\nhigh\r\nvs.\r\nviral\r\nlow*` Viral\r\nload\r\n(%\r\n~1
## [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m
## [90m1[39m 1 High 81.2
## [90m2[39m 2 Low 0.5
## [90m3[39m 3 Low 2
## [90m4[39m 4 Low <0.01
## [90m5[39m 5 High 18.5
## [90m6[39m 6 Low 0.02
## [90m# i abbreviated name:[39m
## [90m# 1: `Viral\r\nload\r\n(%\r\npositive\r\ntissue\r\narea by\r\nRNA\r\nISH)`[39m
## [90m# i 15 more variables: `ISH\r\nhyaline\r\nmembrane\r\nreactivity` <chr>,[39m
## [90m# `RNA\r\nseq\r\ncoverage` <chr>, `qRT\r\nPCR**` <chr>,[39m
## [90m# `Keratin\r\ncells /\r\nmm2` <chr>, `Napsin\r\nA cells\r\n/ mm2` <chr>,[39m
## [90m# `CD1 63\r\ncells/\r\nmm2` <chr>, `CD 3\r\nCells/\r\nmm2` <chr>,[39m
## [90m# `CD 4\r\ncells/\r\nmm2` <chr>, `CD8\r\ncells/\r\nmm2` <chr>, ...[39m
dim(patient_data.df)
## [1] 24 18
Before we attempt one of our non-linear projection methods, let’s try and use some of the methods we’ve spent all this time looking at. We’ll naively repeat our PCA steps on this tissue data and see if we can pull any information from it. We know there are 10 tissue groups in our data so we’ll use that as a basis for clustering.
As we can see from the above plot, a majority of our samples from various tissue types fall into a single cluster. What’s more, as we’ll see, there is some structure in these samples as multiple samples come from the same patient. Sometimes even the same tissue! These relationships may have a large effect in the overall variation we see in the data, making it hard to distinguish between even tissue types. This is why we’ll turn to non-linear projection and see if it can help us out. Before that we’ll need to do some more wrangling…
First let’s examine our patient_data.df which has 24
patient samples (aka cases) listed in total. These 24 cases relate back
to tissue_data.df which have variables representing
different combinations of case and tissue type and observations for some
59K transcripts.
At this point we want to reformat the column names in
patient_data.df a bit before selecting for just the viral
load and viral load percentage information located in the 2nd and 3rd
column. We’ll hold onto this in patient_viral_load.df for
later use.
# Create a dataframe holding just the viral_load information for each patient
patient_viral_load.df <-
patient_data.df %>%
# Retain just the first 3 columns
dplyr::select(1:3) %>%
# Rename the 2nd and 3rd columns
dplyr:rename(case_num = 1, viral_load = 2, viral_load_percent = 3)
## Error in patient_data.df %>% dplyr::select(1:3) %>% dplyr:rename(case_num = 1, : object 'dplyr' not found
head(patient_viral_load.df)
## Error in head(patient_viral_load.df): object 'patient_viral_load.df' not found
We now want to format tissue_data.df much in the way we
did with our PHU data. We want to convert our current data which lists
genes as observations and tissue samples as columns. Essentially, we’d
like to transpose this and we could do that but the
transposition converts everything to a matrix. In the end, we want to
work with a data frame so we can hold more information.
To reduce on memory and runtime, however,
we should trim our dataset. We aren’t really interested in looking at
59,090 transcripts - many of which may be barely expressed. Since these
are normalized counts across the dataset, we can filter out
low-expression genes to make tissue_data_filtered.df. Yes,
this would again be considered a form of feature elimination.
In general we will:
system.time(
# Trim the tissue data down
tissue_data_filtered.df <-
tissue_data.df %>%
# Convert the row names to an actual column
rownames_to_column(var="gene") %>%
# Set up the table to perform row-wise operations
rowwise() %>%
# Calculate the mean expression of each gene across all tissue samples
mutate(mean = ...(c_across(where(is.numeric)))) %>%
# Filter for samples with low expression
filter(mean > 0.5) %>%
# Calculate overall variance in case we need to make our dataset smaller
mutate(variance = ...(c_across(where(is.numeric)))) %>%
# Arrange samples by descending variance
arrange(desc(variance)) %>%
# Remove the grouping specification
ungroup()
)
## [1m[33mError[39m in `mutate()`:[22m
## [1m[22m[36mi[39m In argument: `mean = ...(c_across(where(is.numeric)))`.
## [36mi[39m In row 1.
## [1mCaused by error in `...()`:[22m
## [33m![39m could not find function "..."
## Timing stopped at: 0.05 0 0.06
# Take a look at the final results
head(tissue_data_filtered.df)
## Error in head(tissue_data_filtered.df): object 'tissue_data_filtered.df' not found
# how big is our filtered data frame?
dim(tissue_data_filtered.df)
## Error in eval(expr, envir, enclos): object 'tissue_data_filtered.df' not found
Now that we’ve filtered our data down to ~29k genes, we’ll run through two more steps:
pivot_longer() and pivot_wider().# You need to transpose the data.
# We can do it with dplyr to keep it as a data frame and to add some info
tissue_RNAseq.df <-
tissue_data_filtered.df %>%
# trim down the columns to drop mean/variance
dplyr::select(1:89) %>%
# pivot longer
pivot_longer(cols=c(2:89), names_to = ..., values_to = ...) %>%
# redistribute the gene names to columns
pivot_wider(names_from = ..., values_from = ...)
## Error in tissue_data_filtered.df %>% dplyr::select(1:89) %>% pivot_longer(cols = c(2:89), : '...' used in an incorrect context
# Take a quick look at the transposed result
head(tissue_RNAseq.df)
## Error in head(tissue_RNAseq.df): object 'tissue_RNAseq.df' not found
# We want to add some additional sample information before assessing the data
tissue_RNAseq.df <-
tissue_RNAseq.df %>%
# Grab just the sample names
dplyr::select(sample) %>%
# Grab information from it like case number, tissue, and tissue number
# takes the form of "caseX.tissueY" or "caseX.tissue.NYC" or "NegControlX"
# Remember that this returns a LIST of character matrices
str_match_all(., pattern=r"(case([\w]+)\.([a-z]+)([\d|\.NYC]*)|(NegControl\d))") %>%
# Bind all the matrices from all the list elements together in a single object (likely a matrix)
do.call(rbind, .) %>%
# Convert the results to a data frame
as.data.frame() %>%
# Rename the columns based on the capture groups
dplyr::rename(., sample = V1, case_num = V2, tissue = V3, tissue_num = V4, neg_num = V5) %>%
# Coalesce some of the info due to negative control samples and clean up a column
mutate(case_num = coalesce(case_num, neg_num),
tissue_num = str_replace_all(.$tissue_num, pattern = "\\.", replace = "")) %>%
# Drop the neg_num column
dplyr::select(1:4) %>%
# Join this result to the RNAseq info
full_join(., y=tissue_RNAseq.df, by=c("sample" = "sample")) %>%
# Join that result to grab viral load information
right_join(patient_viral_load.df, y=., by=c("case_num" = "case_num"))
## Error in right_join(patient_viral_load.df, y = ., by = c(case_num = "case_num")): object 'patient_viral_load.df' not found
# Look at the resulting dataframe
head(tissue_RNAseq.df)
## Error in head(tissue_RNAseq.df): object 'tissue_RNAseq.df' not found
# How many tissue types do we have?
table(tissue_RNAseq.df$tissue)
## Error in table(tissue_RNAseq.df$tissue): object 'tissue_RNAseq.df' not found
# Generate a matrix version of our data but drop the sample metadata!
tissue_RNAseq.mx <- as.matrix(tissue_RNAseq.df[, ...])
## Error in as.matrix(tissue_RNAseq.df[, ...]): object 'tissue_RNAseq.df' not found
# head(tissue_RNAseq.mx)
str(tissue_RNAseq.mx)
## Error in str(tissue_RNAseq.mx): object 'tissue_RNAseq.mx' not found
dim(tissue_RNAseq.mx)
## Error in eval(expr, envir, enclos): object 'tissue_RNAseq.mx' not found
Rtsne packageWe now have a somewhat more complex dataset. We are still short on actual samples (now observations) but 88 observations and nearly 30K features isn’t so bad. A broad question we may wish to ask with such a data set is if there is an underlying structure to these samples - i.e. do we see grouping based on tissue type, or perhaps even sample preparation.
t-Distributed Stochastic Neighbour Embedding or t-SNE is a way for us to project our high-dimension data onto a lower dimension with the aim at preserving the local similarities rather than global disparity. When looking at data points, t-SNE will attempt to preserve the local neighbouring structure. As the algorithm pours over the data, it can use different transformations for different regions as it attempts to transform everything to a lower dimension. It is considered “incredibly flexible” at finding local structure where other algorithms may not.
This flexibility is accomplished through 2 steps:
We’ll discuss more about how this algorithm affects interpretation after seeing the results, but this is considered an exploratory data visualization tool, rather than explanatory.
To produce our t-SNE projection we’ll use the Rtsne()
function from the package of the same name. Some important parameters
are:
X is our data matrix where each row is an
observation
dims sets the number of dimensions we’d like to
project onto (default is 2).
initial_dims sets the number of dimensions that
should be retained in the initial PCA step (Default 50).
perplexity a numeric parameter that tunes between
local and global aspects of your data.
This parameter is a guess as to how many close neighbours a point may have. If you have a sense of sample types (or clusters!) ahead of time, you could try to play with this value (default is 30).
According to the algorithm this value should follow this rule: \(perplexity * 3 \lt nrow(X) -1\)
pca_scale is a boolean to set if the initial PCA
step should use scaled data.
max_iter is the maximum number of iterations in the
algorithm (default is 1000).
# Rtsne prefers using a matrix for memory issues
# set a seed for reproducible results
set.seed(2025)
# Try for perplexity of 30 can go as high as 29 before crash
# We have just 90 samples, but between 1-52 samples per "group"
tissue_tsne <- Rtsne(...,
dims=2,
perplexity=5,
verbose=TRUE,
pca_scale = TRUE,
max_iter = 1500)
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
# What does our t-SNE object look like?
str(...)
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
Looking above at the result of our t-SNE analysis, we can notice a few things…
Y is an 88x2 matrix holding the coordinates
for our new projection.That’s right, there is no underlying information for mapping back to our original dataset. It’s a completely black box with no way to reverse engineer the process. That’s because the process itself is stochastic! Whereas PCA was a deterministic process - repeatable the same way every time with the same data - that is not the case for t-SNE. That’s why even though it can be quite powerful in identifying local similarities, t-SNE does not provide a mathematical pathway back to our original data!
Let’s extract the information, combine it with our sample information
and project it using ggplot2. We can do this with a
scatterplot since we have a set of x/y coordinates we can work with.
# Build our new data frame from the Y values
tissue_tsne.df <- data.frame(x.coord = ...,
y.coord = ...)
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
# Add in our sample information
tissue_tsne.df <- cbind(tissue_RNAseq.df[,1:6], tissue_tsne.df)
## Error in cbind(tissue_RNAseq.df[, 1:6], tissue_tsne.df): object 'tissue_RNAseq.df' not found
# Fix up the information just a little bit to remove NA viral load information
tissue_tsne.df <-
tissue_tsne.df %>%
# replace NAs with DNW (did not work)
mutate(viral_load = replace_na(viral_load, replace = "DNW"))
## Error in mutate(., viral_load = replace_na(viral_load, replace = "DNW")): object 'tissue_tsne.df' not found
head(tissue_tsne.df)
## Error in head(tissue_tsne.df): object 'tissue_tsne.df' not found
combo.colours = c(brewer.pal(12, "Paired"), brewer.pal(12, "Set3"), brewer.pal(8, "Set1"))
# 1. Data
ggplot(data = tissue_tsne.df) +
# 2. Aesthetics
aes(x = x.coord,
y = y.coord,
colour = ...) +
# Themes
theme_bw() +
theme(text = element_text(size=20)) +
# 3. Scaling
scale_colour_manual(values = combo.colours) +
# 4. Geoms
geom_text(aes(label = case_num), size = 10)
## Error in ggplot(data = tissue_tsne.df): object 'tissue_tsne.df' not found
While we don’t have a lot of samples, you can still see that we were able to cluster some of our data by cell types without providing that classification to the algorithm! Great job team!
We can see that we get close clustering of tissues like placenta, heart, and bowel. Our liver samples are kind of everywhere but perhaps using a different perplexity would provide different results.
One interesting thing we can see is that regardless of tissue type, we see some samples are clustering together based on case number - namely case numbers 1, 4 and 5 seem to have some strong underlying expression profiles that connect them across tissue samples. We may also be seeing false relationships so beware!
This algorithm for projection is in the same flavour as t-SNE projection but has some differences including:
What does that mean for us? Faster results, and more interpretive
results! Otherwise the same issues can apply. The setup is slightly
easier with few options to change if you leave the defaults. We can
access umap() from the package of the same name. You may
also alter the default methods by creating a umap.defaults
object. More information on that here
For more tinkering, you can choose to use the uwot package instead where the
umap() function has more options that are easily
modified.
# Set our seed
set.seed(2025)
# Generate our projection
tissue_umap <- ...(tissue_RNAseq.mx)
## Error in ...(tissue_RNAseq.mx): could not find function "..."
# What does the UMAP object look like?
str(tissue_umap)
## Error in str(tissue_umap): object 'tissue_umap' not found
Looking at our UMAP object tissue_umap we see it houses
the projection parameters used but also some additional variables:
data: holds our original data matrix.layout: contains the projection coordinates we need for
plotting the data.knn: a weighted k-nearest neighbour graph. This is a
graph that connects each observation to its nearest k neighbours. This
generates the first topological representation of the data - like an
initial sketch.You may notice again that there is no data that suggests how we arrived at this solution. There are no eigenvectors or values to reverse the projection!
Let’s extract the layout information, combine it with
our sample information and project it using ggplot2
# Re-map our projection points with our tissue data
tissue_umap.df <- data.frame(x.coord = tissue_umap$...,
y.coord = tissue_umap$...)
## Error in data.frame(x.coord = tissue_umap$..., y.coord = tissue_umap$...): object 'tissue_umap' not found
tissue_umap.df <- cbind(tissue_RNAseq.df[,1:6], tissue_umap.df)
## Error in cbind(tissue_RNAseq.df[, 1:6], tissue_umap.df): object 'tissue_RNAseq.df' not found
tissue_umap.df <-
tissue_umap.df %>%
# replace NAs with DNW (did not work)
mutate(viral_load = replace_na(viral_load, replace = "DNW"))
## Error in mutate(., viral_load = replace_na(viral_load, replace = "DNW")): object 'tissue_umap.df' not found
head(tissue_umap.df)
## Error in head(tissue_umap.df): object 'tissue_umap.df' not found
combo.colours = c(brewer.pal(12, "Paired"), brewer.pal(12, "Set3"), brewer.pal(8, "Set1"))
# 1. Data
ggplot(data = ...) +
# 2. Aesthetics
aes(x = x.coord,
y = y.coord,
colour = tissue) +
# Themes
theme_bw() +
theme(text = element_text(size=20)) +
# 3. Scaling
scale_colour_manual(values = combo.colours) +
# 4. Geoms
geom_text(aes(label = case_num), size = 10)
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
So it looks like without much tinkering we retrieved a fairly nice result. We can see now that liver and kidney are more closely grouped as tissues, while heart samples generally cluster together still. Bowel and jejunum appear spatially grouped and our placenta samples are still close to each other. The clustering we saw with samples 1, 4, and 5 appear to be less severe.
There does appear to be some structure between the lung samples in different case numbers so this might be an avenue to explore next to try and see if there truly is a relationship between these groups.
While t-SNE and UMAP produce projections onto a 2-dimensional plane, they did not strictly clustered the data. There is not labeling produced and you have no route back to understanding the relationships between closely plotted points. PCA, on the other hand, is strictly a dimension reduction tool. It does not place or assign datapoints to any groups BUT it is useful to use on large datasets prior to clustering!
Today we took a deep dive into principal component analysis. There are of course different variants of this based on the assumptions you can make about your observations and variables like independent component analysis (ICA, non-Gaussian features) and multiple correspondence analysis (MCA, categorical features). Some additional methods can also be used to store the transformation like a PCA does, notably principal coordinate analysis (PCoA) and variational autoencoders (VAE).
Overall we should remember that while PCA can have problems in generating it’s feature extraction, it is deterministic and repeatable. Also, the final results are provided in such a way that new observations could be transformed and projected onto the same principal components. You can also feed these components back into clustering algorithms like k-means to try and identify specific subgroups.
t-SNE and UMAP, on the other hand appear to do a much better job with high-dimensional data. They can preserve local structure and UMAP can also do a fairly good job of preserving global structure. These tools make for great exploratory analysis of your complex datasets. Interpretation of relationships, however, are not mathematically clear like in PCA. These are, after all projections from a higher dimension for our simpler primate brains!
This week’s assignment will be found under the current lecture folder under the “assignment” subfolder. It will include an R markdown notebook that you will use to produce the code and answers for this week’s assignment. Please provide answers in markdown or code cells that immediately follow each question section.
| Assignment breakdown | ||
|---|---|---|
| Code | 50% | - Does it follow best practices? |
| - Does it make good use of available packages? | ||
| - Was data prepared properly | ||
| Answers and Output | 50% | - Is output based on the correct dataset? |
| - Are groupings appropriate | ||
| - Are correct titles/axes/legends correct? | ||
| - Is interpretation of the graphs correct? |
Since coding styles and solutions can differ, students are encouraged to use best practices. Assignments may be rewarded for well-coded or elegant solutions.
You can save and download the markdown notebook in its native format. Submit this file to the the appropriate assignment section by 12:59 pm on the date of our next class: April 17th, 2025.
Revision 1.0.0: created and prepared for CSB1021H S LEC0141, 03-2021 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Revision 1.0.1: edited and prepared for CSB1020H S LEC0141, 03-2022 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Revision 1.0.2: edited and prepared for CSB1020H S LEC0141, 03-2023 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Revision 2.0.0: Revised and prepared for CSB1020H S LEC0141, 03-2024 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Revision 2.0.1: Revised and prepared for CSB1020H S LEC0141, 03-2025 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
More information on calculating optimal clusters: https://www.datanovia.com/en/lessons/determining-the-optimal-number-of-clusters-3-must-know-methods/
Step-by-step how PCA works: https://builtin.com/data-science/step-step-explanation-principal-component-analysis
More PCA explanations here: https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues
The math of PCA: https://www.cs.princeton.edu/picasso/mats/PCA-Tutorial-Intuition_jp.pdf
t-SNE in R and Python: https://datavizpyr.com/how-to-make-tsne-plot-in-r/
All about UMAP: https://umap-learn.readthedocs.io/en/latest/basic_usage.html
The Centre for the Analysis of Genome Evolution and Function (CAGEF) at the University of Toronto offers comprehensive experimental design, research, and analysis services in microbiome and metagenomic studies, genomics, proteomics, and bioinformatics.
From targeted DNA amplicon sequencing to transcriptomes, whole genomes, and metagenomes, from protein identification to post-translational modification, CAGEF has the tools and knowledge to support your research. Our state-of-the-art facility and experienced research staff provide a broad range of services, including both standard analyses and techniques developed by our team. In particular, we have special expertise in microbial, plant, and environmental systems.
For more information about us and the services we offer, please visit https://www.cagef.utoronto.ca/.